This tutorial is based on two sources:

  1. https://hbctraining.github.io/Training-modules/IntroR/ by Meeta Mistry, Mary Piper, and Radhika Khetani

  2. https://rspatial.org/raster/sdm/index.html by Robert J. Hijmans and Jane Elith.

Part 1. How to use this tutorial document

This file should have opened in a web browser window. It doesn’t run anything in R by itself; instead you will need to copy and paste (or retype) commands from it.

Whenever you see something like this,

print("hello world") # comments look like this; you don't have to copy them
## [1] "hello world"

the tutorial will display two code boxes.

  • The box without the two hash characters (##) contains the command, which is the text that you will run in R. To run something in R, simply copy the text in the gray codebox into your console and press Return.

    • Note: Inside the command, anything that starts with a single # and is displayed in grey is just a comment; you don’t have to run it, but it doesn’t hurt anything if you copy and paste it into R.
  • The box with the ## is the result, which should correspond to what will be printed in your console window.

Part 2. Getting started with RStudio

Find the file “installscript.R” in the class network directory where you found this tutorial, and choose RStudio to open it in the RStudio program. RStudio is a development environment for R, which means it provides a graphical interface for writing code in the R programming language.

The RStudio interface has four main panels:

  • Console: where you can type commands and see output. The console is all you would see if you ran R in the command line without RStudio. It’s the green box in the image below.

  • Script editor: where you can type out commands and save to file. You can also submit the commands to run in the console. Right now it’s showing “installscript.R”. It’s the red box in the image below.

  • Environment shows all active objects and History keeps track of all commands run in the console. It’s the blue box in the image below.

  • Files/Plots/Packages/Help: several different tabs that show the active directory, plots, installed packages (more about this later), help files, etc. It’s the yellow box in the image below.

Screenshot of the RStudio panels

First you’re going to use the script editor (top left panel). Click “Source” in the top right of this panel to run the “installscript.R” script, which will install all the libraries you’ll need for this course. While it’s running, look at the other panels too.

Next find the console (the bottom left panel). When you type something into this command-line interface and hit Enter, the text you entered will be run in the R processor and the results will be returned. Right now you’ll see a bunch of text scrolling by as it installs the packages.

The command prompt

Interpreting the command prompt can help understand when R is ready to accept commands. Below lists the different states of the command prompt and how you can exit a command:

  • Console is ready to accept commands: >

If R is ready to accept commands, the R console shows a > prompt. Can you find this on your own screen?

When the console receives a command (by directly typing into the console or running from the script editor (Ctrl-Enter), R will try to execute it. After running, the console will show the results and come back with a new > prompt to wait for new commands.

  • Console is waiting for you to enter more data: +

If R is still waiting for you to enter more data because the code sent to the console isn’t a complete command yet, the console will show a + prompt. It means that you haven’t finished entering a complete command. Often this can be due to you having not ‘closed’ a parenthesis or quotation.

If you can’t figure out why your command isn’t running, you can click inside the console window and press the Escape key to escape the command and bring back a new prompt >; then you can start over sending the command.

Using the command prompt

Once your libraries are finished installing and it shows a command prompt, run the following command by typing or pasting it into the console and hitting Enter:

getwd()
## [1] "/Users/jblois/Documents/GitHub/biodata_shortcourse/development"

This should show you where in the computer’s file structure is your current working directory. (It will NOT look like the result above.) If you look in the “Files” tab in the bottom right panel, you will see all the objects in this directory, which you can also get using the following command:

dir()
##  [1] "biodata_BobcatSTEM.Rproj" "Blois_Day1.RData"        
##  [3] "climate"                  "course_overview.html"    
##  [5] "course_overview.Rmd"      "data-cleaning.R"         
##  [7] "day1_tutorial.html"       "day1_tutorial.Rmd"       
##  [9] "day2_tutorial.Rmd"        "day3_tutorial.Rmd"       
## [11] "fix-paleoclimate.R"       "gbif-download.R"         
## [13] "images"                   "neotoma-download.R"      
## [15] "neotoma-raw.RData"        "species-range.R"

Try some other stuff to see how this works.

9+6 #you can just use it as a calculator
## [1] 15
sum(9,6) #you can also use functions instead of arithmetic symbols. In this case, the word "sum" indicates the function, which is acting on the values within the parentheses.
## [1] 15

Now, try something deliberately wrong. Copy and paste this line of code into your console, then press Enter:

9+6+ 

If you look at your console, you will see that instead of an answer (15), you see the + underneath a line of code that says 9+6+. To complete the equation, type a 0 after the + within the console. You have now ‘closed’ the line of code and gotten your answer.

Remember, you can always click inside the console window and press the Escape key to escape the mistake and bring back a new prompt >; then you can start over sending the command.

The script editor

Now try the script editor (top left window in RStudio). Open up a new script by navigating to File –> New File –> R script. Once you have a blank script open, paste in the following:

# I am adding 3 and 5!
3 + 5

It didn’t run just because you wrote it in the script and not in the console. Highlight the pasted text within your script and hit Ctrl+Enter (or click Run in the top right corner of the pane): the highlighted text will be sent to the console and your result will appear.

This is useful for when you need to run the same command multiple times, such as when you’re trying to get something right – that’s why it’s called the “editor”. You should make a habit of writing your commands in the code editor instead of the console, because then you can easily go back to your script later to see exactly how you did it.

Syntax

Notice that the statement “I am adding 3 and 5!” in your script started with the comment symbol, #. What happens if we do that same command without the #? Re-run the command after removing the # sign in the front: I am adding 3 and 5! Now R is trying to run that sentence as a command, and it doesn’t work. We get an error in the console “Error: unexpected symbol in”I am”” means that the R interpreter did not know what to do with that command. Things sent to the console won’t work unless they are properly constructed commands in the R language.

Use the # character to insert comments about what your code is doing. This, again, makes it easier to understand your own work later.

Assignment operator

To do useful and interesting things in R, we need to assign values to variables using the assignment operator, <-. For example, we can use the assignment operator to assign the value of 3 to a variable named x by running:

x  <-  3

The assignment operator (<-) assigns values on the right to variables on the left.

Variables

A variable in computer programming is a symbolic name for a location where information can be maintained and referenced. You can think of a variable like a “bucket” of information with a label on the outside. When referring to the bucket of information, we use the label on the bucket (the variable name), not the data stored in the bucket (the value).

In the example above, we created a variable or a ‘bucket’ called x. Inside we put a value, 3.

Let’s create another variable called y and give it a value of 5.

y  <-  5

When assigning a value to an variable, R does not print anything to the console. You can tell it to print the value by typing the variable name:

y
## [1] 5

You can also view information on all the currently stored variables by looking in your Environment window in the upper right-hand corner of the RStudio interface.

Now we can reference these buckets by name to perform mathematical operations on the values contained within. What do you get in the console for the following operation?

x+y
## [1] 8

Try assigning the results of this operation to another variable called number.

result  <-  x + y
result
## [1] 8

Practice:

  1. Change the value of the variable x to 5 using the assignment operator. What happens to result? Does it change?
  2. Now try changing the value of variable y to contain the value 10. What do you need to do to update the variable result to the new value of x + y? Show your results to an instructor. ***

Tips on variable names

Variables can be given almost any name, such as x, current_temperature, or subjectID. However, there are some rules / suggestions you should keep in mind:

  • R is case sensitive (e.g., X is different from x)
  • Variable names can’t start with a number (2x is not valid but x2 is)
  • You can’t use names of fundamental functions in R (e.g., if, else, for). In general, even if it’s allowed, it’s best to not use other function names (e.g., c, T, mean, data) as variable names. – You can type ? followed by the name to see if the name is already in use by a built-in function.
  • Use short variable names; longer names = more typos.
  • Before you assign a new variable, check in the Environment tab to make sure you didn’t already use the name.

Data Storage

Data Types

Variables can contain values of specific types within R. The most common basic data types in R include:

  • "numeric" for any numerical value
  • "character" for text values, denoted by using quotes (““) around value
  • "logical" for TRUE and FALSE (the boolean data type)

The table below provides examples of each of the commonly used data types:

Data Type Examples
Numeric: 1, 1.5, 20, pi
Character: “anytext”, “5”, “TRUE”
Logical: TRUE, FALSE, T, F

Data Structures

We know that variables are like buckets, and so far we have seen that bucket filled with a single value. Even when result`` was created, the result of the mathematical operation was a single value. **Variables can store more than just a single value, they can store a multitude of different data structures.** These include, but are not limited to, vectors (c), factors (factor), matrices (matrix), data frames (data.frame) and lists (list`).

Vector

A vector is the most common and basic data structure in R, and is pretty much the workhorse of R. It can be constructed with the combine command, c(). It’s basically just a collection of values, mainly either numbers,

c(1, 40, 9, 22)
## [1]  1 40  9 22

or characters,

c("a", "b", "c", "q")
## [1] "a" "b" "c" "q"

or logical values.

c(TRUE, TRUE, FALSE, TRUE)
## [1]  TRUE  TRUE FALSE  TRUE

Note that all values in a vector must be of the same data type. If you try to create a vector with more than a single data type, R will try to coerce it into a single data type. For example, if you were to try to create the following vector:

c("a", 9, 12, TRUE)
## [1] "a"    "9"    "12"   "TRUE"

R will turn it into the following by forcing (“coercing”) all the values to character type: [1] "a" "9" "12" "TRUE"

The analogy for a vector is that your bucket now has different compartments; these compartments in a vector are called elements. Each element contains a single value, and there is no limit to how many elements you can have. A vector is assigned to a single variable, because regardless of how many elements it contains, in the end it is still a single bucket.

Let’s create a vector of specimen counts and assign it to a variable called specCounts. Run the following lines:

specCounts  <-  c(3000, 50000, 46)
specCounts
## [1]  3000 50000    46

Each element of this vector contains a single numeric value, and three values will be combined together into a vector using c() (the combine function). All of the values are put within the parentheses and separated with a comma.

Looking in your Environment tab, you can see that the specCounts variable you just created is numeric, starts at element 1 and ends at element 3 (i.e. it’s a vector containing 3 numeric values).

A vector can also contain characters. Run the following code to create another vector called species with three elements, where each element corresponds with the previous vector.

species <- c("crocodile", "trout", "panda")
species
## [1] "crocodile" "trout"     "panda"

Matrix

A matrix in R is a collection of vectors of the same length and type. Vectors can be combined as columns in the matrix or by row, to create a 2-dimensional structure.

Matrices are used commonly as part of the mathematical machinery of statistics. We don’t create these manually very often, but they’re very commonly used inside R functions. They are usually of numeric datatype and used in computational algorithms to serve as a checkpoint. For example, if input data is not of identical data type (numeric, character, etc.), the matrix() function will throw an error and stop any downstream code execution.

Data Frame

A data.frame is the most common data structure in R for storing data in tables, and it’s what we use for statistics and plotting. A data.frame is similar to a matrix in that it’s a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors).

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier.

We can create a dataframe by bringing vectors together to form the columns. We do this using the data.frame() function. We give the function the different vectors we would like to bind together, and it creates the data frame. This function will only work for vectors of the same length.

df <- data.frame(species,specCounts)
df
##     species specCounts
## 1 crocodile       3000
## 2     trout      50000
## 3     panda         46

You can see that there are two columns, each one containing one of the input vectors.

List

Lists are a data structure in R that can be perhaps a bit daunting at first, but soon become amazingly useful. A list is a data structure that can hold any number of any types of other data structures, one after another.

If you have variables of different data structures you wish to combine, you can put all of those into one list object by using the list() function and placing all the items you wish to combine within parentheses.

Run the following to construct a list called “list1” that contains all the data structures we’ve seen so far in this tutorial.

list1 <- list(result, species, specCounts)
list1
## [[1]]
## [1] 8
## 
## [[2]]
## [1] "crocodile" "trout"     "panda"    
## 
## [[3]]
## [1]  3000 50000    46

There are three components corresponding to the three different variables we passed in, and what you see is that the structure of each is retained.

Functions

A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.

The general usage for a function is the name of the function followed by parentheses:

function_name(input)

The input(s) are called arguments, which can include:

  1. the data structure or data structures on which the function operates
  2. specifications that alter the way the function operates

Not all functions take arguments, for example:

getwd()

However, most functions take one or more arguments. If you don’t specify a required argument when calling the function, you will receive an error. Other arguments are optional: if you don’t include them, the function will fall back on using a default. The defaults represent standard values that the author of the function specified as being “good enough in standard cases”, but if you want something specific, simply change the argument to the value of your choice.

Basic functions

We have already used a few examples of basic functions in the previous lessons i.e getwd(), c(), and data.frame(). These functions are available as part of R’s built in capabilities, and we will explore a few more of these base functions below.

You can also get functions from external packages or libraries, or even write your own.

Let’s revisit the function c() that we have used previously to combine data into vectors. The arguments it takes are a collection of numbers, characters or strings (separated by a comma). The c() function performs the task of combining all the numbers or characters provided as arguments into a single vector. You can also pass an existing vector as one of the arguments in order to add elements to it:

specCountsLonger <- c(900,specCounts) #adds the new value at the beginning 
#or
specCountsLonger <- c(specCounts,900) #adds the new value at the end

What happens here is that we take the original vector specCounts (containing three elements), and add another item to one end. You can imagine doing this over and over again to build a vector.

Since R is used for statistical computing, many of the base functions involve mathematical operations. If interested, we have linked a detailed guide for performing basic statistical tests in R. One example of a base R mathematical function would be sqrt(). The input/argument must be a number, and the the output is the square root of that number. Let’s try finding the square root of 81:

sqrt(81)
## [1] 9

Now what would happen if we called the function (e.g. ran the function), on a vector of values instead of a single value?

sqrt(specCounts)
## [1]  54.77226 223.60680   6.78233

In this case the function was called on each individual value of the vector specCounts and the respective results were displayed. Beware: this does not work with every function!

Let’s try another function, this time using one that we can change some of the options (arguments that change the behavior of the function), for example round:

round(3.14159)
## [1] 3

We can see that we get 3. That’s because the default is to round to the nearest whole number. What if we want a different number of significant digits? How would we change the default?

Seeking help on arguments for functions

The best way of finding out this information is to use the help operator ? followed by the name of the function. Doing this will open up the help manual in the bottom right panel of RStudio that will provide a description of the function, usage, arguments, details, and examples:

?round

If you scroll through the help file for the function, you will see a lot of details - different but related functions (ie, ceiling); Usage examples (here it lists the default values as well); detail on the input / arguments; lots more details; and Examples.

You can also use the example() function to run the examples from the help file. (This one has a lot of examples!)

example(round)
## 
## round> round(.5 + -2:4) # IEEE / IEC rounding: -2  0  0  2  2  4  4
## [1] -2  0  0  2  2  4  4
## 
## round> ## (this is *good* behaviour -- do *NOT* report it as bug !)
## round> 
## round> ( x1 <- seq(-2, 4, by = .5) )
##  [1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0
## 
## round> round(x1) #-- IEEE / IEC rounding !
##  [1] -2 -2 -1  0  0  0  1  2  2  2  3  4  4
## 
## round> x1[trunc(x1) != floor(x1)]
## [1] -1.5 -0.5
## 
## round> x1[round(x1) != floor(x1 + .5)]
## [1] -1.5  0.5  2.5
## 
## round> (non.int <- ceiling(x1) != floor(x1))
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [13] FALSE
## 
## round> x2 <- pi * 100^(-1:3)
## 
## round> round(x2, 3)
## [1]       0.031       3.142     314.159   31415.927 3141592.654
## 
## round> signif(x2, 3)
## [1] 3.14e-02 3.14e+00 3.14e+02 3.14e+04 3.14e+06

If you are already familiar with the function but just need to remind yourself of the names of the arguments, you can use:

str(round)
## function (x, digits = 0, ...)

This tells us that we can change the number of digits returned by adding an optional argument. We can type digits = 2 or however many we may want:

round(3.14159, digits = 2)
## [1] 3.14

Practice:

Another commonly used base function is mean(). Use this function to calculate an average for the specCounts vector, and show your result to the instructor. (If you look at the help file, you will see that the arguments for the mean() function are supplied in a different data structure than the other functions we’ve seen so far.)


Data

The last thing we’re going to cover in this introduction is how to inspect data.

Selecting data using indexes and sequences

When analyzing data, we often want to partition the data so that we are only working with selected columns or rows. A data frame or data matrix is simply a collection of vectors combined together. So let’s begin with vectors and how to access different elements, and then extend those concepts to dataframes.

Vectors

If we want to extract one or several values from a vector, we must provide one or several indexes using square brackets [ ] syntax. The index represents the location of the element within a vector (or the compartment number, if you think of the bucket analogy). R indexes start at 1.

Let’s start by creating a vector called age:

age  <-  c(15, 22, 45, 52, 73, 81)

Suppose we only wanted the second value of this vector, we would use the following syntax:

age[2]
## [1] 22

If we wanted all values except the second value of this vector, we would use the following:

age[-2]
## [1] 15 45 52 73 81

If we wanted to select more than one element we would still use the square bracket syntax, but rather than using a single value we would pass in a vector of several index values:

idx  <-  c(3,5,6) # create vector of the elements of interest
age[idx]
## [1] 45 73 81

To select a sequence of continuous values from a vector, we would use : which is a special operator that creates numeric vectors of integers in increasing or decreasing order. Let’s select the first four values from age:

age[1:4]
## [1] 15 22 45 52

Practice: Try reversing that to say 4:1 and see what happens!

Selection of values can also be performed using logical expressions. Logical operators include greater than (>), less than (<), and equal to (==). We can use logical expressions to determine whether a particular condition is true or false. Then, subset out the TRUE values:

age[age > 50]
## [1] 52 73 81

More details about using logical expressions to subset data can be found here

Dataframes

We’re going to use the built-in data set called iris. This single dataframe contains the measurements in centimeters of the variables sepal length, sepal width, petal length and petal width for 50 flowers from each of 3 species of iris, a total of 150 specimens. The species are Iris setosa, I. versicolor, and I. virginica.

Inspecting data frames

This is a small dataframe, so you can just look at it in the console first.

iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Or check how many rows and columns it has with dim():

dim(iris)
## [1] 150   5

However, 150 lines is still a little inconvenient if you just want to see what the data in each column are generally like. Try this:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Now you see just the first 6 lines, as well as the header (column names). Each row holds information for a single specimen, and the columns contain information about the specimen’s measurements and species. What data type is each column? Check using str(), which we used before to inspect the arguments of a function. When you call it on a variable, it tells you about the data structure and types.

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Uh-oh, what’s a factor data type? Just a character field with a restricted set of possible values called “levels”. Don’t worry about that for now.

(You can also look at this in a separate tab in RStudio. Within the “Environment” panel, choose “package:datasets” from the dropdown that currently says “Global Environment”. Then click on iris in the Environment tab to open the data table in a new tab in the same pane as the script editor.)

Selecting data from dataframes

Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the index for each dimension. We use the same square bracket notation but rather than providing a single index, there are two indexes. Within the square bracket, row numbers come first followed by column numbers, and the two are separated by a comma; i.e., dataframe[row,column]

iris[1, 1]   # element from the first row in the first column of the data frame
## [1] 5.1
iris[1, 3]   # element from the first row in the 3rd column
## [1] 1.4

To select whole rows, you provide only the index for the rows and leave the columns index blank. The key here is to include the comma, to let R know that you are accessing a 2-dimensional data structure:

iris[3, ]    #returns a vector containing all elements in the 3rd row
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 3          4.7         3.2          1.3         0.2  setosa

If you were selecting specific columns from the data frame - the rows are left blank:

iris[ , 3]    #returns a vector containing all elements in the 3rd column
##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
##  [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
##  [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
##  [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
##  [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
##  [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1

Just like with vectors, you can select multiple rows and columns at a time. Within the square brackets, you need to provide a vector of the desired values:

iris[ , 1:2] #returns a dataframe containing first two columns
##     Sepal.Length Sepal.Width
## 1            5.1         3.5
## 2            4.9         3.0
## 3            4.7         3.2
## 4            4.6         3.1
## 5            5.0         3.6
## 6            5.4         3.9
## 7            4.6         3.4
## 8            5.0         3.4
## 9            4.4         2.9
## 10           4.9         3.1
## 11           5.4         3.7
## 12           4.8         3.4
## 13           4.8         3.0
## 14           4.3         3.0
## 15           5.8         4.0
## 16           5.7         4.4
## 17           5.4         3.9
## 18           5.1         3.5
## 19           5.7         3.8
## 20           5.1         3.8
## 21           5.4         3.4
## 22           5.1         3.7
## 23           4.6         3.6
## 24           5.1         3.3
## 25           4.8         3.4
## 26           5.0         3.0
## 27           5.0         3.4
## 28           5.2         3.5
## 29           5.2         3.4
## 30           4.7         3.2
## 31           4.8         3.1
## 32           5.4         3.4
## 33           5.2         4.1
## 34           5.5         4.2
## 35           4.9         3.1
## 36           5.0         3.2
## 37           5.5         3.5
## 38           4.9         3.6
## 39           4.4         3.0
## 40           5.1         3.4
## 41           5.0         3.5
## 42           4.5         2.3
## 43           4.4         3.2
## 44           5.0         3.5
## 45           5.1         3.8
## 46           4.8         3.0
## 47           5.1         3.8
## 48           4.6         3.2
## 49           5.3         3.7
## 50           5.0         3.3
## 51           7.0         3.2
## 52           6.4         3.2
## 53           6.9         3.1
## 54           5.5         2.3
## 55           6.5         2.8
## 56           5.7         2.8
## 57           6.3         3.3
## 58           4.9         2.4
## 59           6.6         2.9
## 60           5.2         2.7
## 61           5.0         2.0
## 62           5.9         3.0
## 63           6.0         2.2
## 64           6.1         2.9
## 65           5.6         2.9
## 66           6.7         3.1
## 67           5.6         3.0
## 68           5.8         2.7
## 69           6.2         2.2
## 70           5.6         2.5
## 71           5.9         3.2
## 72           6.1         2.8
## 73           6.3         2.5
## 74           6.1         2.8
## 75           6.4         2.9
## 76           6.6         3.0
## 77           6.8         2.8
## 78           6.7         3.0
## 79           6.0         2.9
## 80           5.7         2.6
## 81           5.5         2.4
## 82           5.5         2.4
## 83           5.8         2.7
## 84           6.0         2.7
## 85           5.4         3.0
## 86           6.0         3.4
## 87           6.7         3.1
## 88           6.3         2.3
## 89           5.6         3.0
## 90           5.5         2.5
## 91           5.5         2.6
## 92           6.1         3.0
## 93           5.8         2.6
## 94           5.0         2.3
## 95           5.6         2.7
## 96           5.7         3.0
## 97           5.7         2.9
## 98           6.2         2.9
## 99           5.1         2.5
## 100          5.7         2.8
## 101          6.3         3.3
## 102          5.8         2.7
## 103          7.1         3.0
## 104          6.3         2.9
## 105          6.5         3.0
## 106          7.6         3.0
## 107          4.9         2.5
## 108          7.3         2.9
## 109          6.7         2.5
## 110          7.2         3.6
## 111          6.5         3.2
## 112          6.4         2.7
## 113          6.8         3.0
## 114          5.7         2.5
## 115          5.8         2.8
## 116          6.4         3.2
## 117          6.5         3.0
## 118          7.7         3.8
## 119          7.7         2.6
## 120          6.0         2.2
## 121          6.9         3.2
## 122          5.6         2.8
## 123          7.7         2.8
## 124          6.3         2.7
## 125          6.7         3.3
## 126          7.2         3.2
## 127          6.2         2.8
## 128          6.1         3.0
## 129          6.4         2.8
## 130          7.2         3.0
## 131          7.4         2.8
## 132          7.9         3.8
## 133          6.4         2.8
## 134          6.3         2.8
## 135          6.1         2.6
## 136          7.7         3.0
## 137          6.3         3.4
## 138          6.4         3.1
## 139          6.0         3.0
## 140          6.9         3.1
## 141          6.7         3.1
## 142          6.9         3.1
## 143          5.8         2.7
## 144          6.8         3.2
## 145          6.7         3.3
## 146          6.7         3.0
## 147          6.3         2.5
## 148          6.5         3.0
## 149          6.2         3.4
## 150          5.9         3.0
iris[c(1,3,6), ] #returns a dataframe containing first, third and sixth rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

For larger datasets, it can be tricky to remember which column number corresponds to a particular variable. In some cases, the column number for a variable can change if the script you are using adds or removes columns. It’s therefore often better to use column names to refer to a particular variable, and it makes your code easier to read and your intentions clearer.

iris[1:3 , "Petal.Length"] # values of the Petal.Length column from the first three rows/samples.
## [1] 1.4 1.4 1.3

You can also select and do operations on a particular column, by selecting it using the $ sign. In this case, the entire column is a vector. For instance, to extract all the species names from our dataset, we can use:

iris$Species
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

You can use names() or colnames() to remind yourself of the column names. We can then supply index values to select specific values from that vector. For example, if we wanted the petal widths for the first five samples in iris:

colnames(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"
iris$Petal.Width[1:5]
## [1] 0.2 0.2 0.2 0.2 0.2

The $ allows you to select a single column by name, which is a one-dimensional vector that requires only one index and no commas. To select multiple columns by name, you need to make a vector of strings that correspond to column names and supply it to the dataframe name:

iris[, c("Petal.Length", "Petal.Width")]
##     Petal.Length Petal.Width
## 1            1.4         0.2
## 2            1.4         0.2
## 3            1.3         0.2
## 4            1.5         0.2
## 5            1.4         0.2
## 6            1.7         0.4
## 7            1.4         0.3
## 8            1.5         0.2
## 9            1.4         0.2
## 10           1.5         0.1
## 11           1.5         0.2
## 12           1.6         0.2
## 13           1.4         0.1
## 14           1.1         0.1
## 15           1.2         0.2
## 16           1.5         0.4
## 17           1.3         0.4
## 18           1.4         0.3
## 19           1.7         0.3
## 20           1.5         0.3
## 21           1.7         0.2
## 22           1.5         0.4
## 23           1.0         0.2
## 24           1.7         0.5
## 25           1.9         0.2
## 26           1.6         0.2
## 27           1.6         0.4
## 28           1.5         0.2
## 29           1.4         0.2
## 30           1.6         0.2
## 31           1.6         0.2
## 32           1.5         0.4
## 33           1.5         0.1
## 34           1.4         0.2
## 35           1.5         0.2
## 36           1.2         0.2
## 37           1.3         0.2
## 38           1.4         0.1
## 39           1.3         0.2
## 40           1.5         0.2
## 41           1.3         0.3
## 42           1.3         0.3
## 43           1.3         0.2
## 44           1.6         0.6
## 45           1.9         0.4
## 46           1.4         0.3
## 47           1.6         0.2
## 48           1.4         0.2
## 49           1.5         0.2
## 50           1.4         0.2
## 51           4.7         1.4
## 52           4.5         1.5
## 53           4.9         1.5
## 54           4.0         1.3
## 55           4.6         1.5
## 56           4.5         1.3
## 57           4.7         1.6
## 58           3.3         1.0
## 59           4.6         1.3
## 60           3.9         1.4
## 61           3.5         1.0
## 62           4.2         1.5
## 63           4.0         1.0
## 64           4.7         1.4
## 65           3.6         1.3
## 66           4.4         1.4
## 67           4.5         1.5
## 68           4.1         1.0
## 69           4.5         1.5
## 70           3.9         1.1
## 71           4.8         1.8
## 72           4.0         1.3
## 73           4.9         1.5
## 74           4.7         1.2
## 75           4.3         1.3
## 76           4.4         1.4
## 77           4.8         1.4
## 78           5.0         1.7
## 79           4.5         1.5
## 80           3.5         1.0
## 81           3.8         1.1
## 82           3.7         1.0
## 83           3.9         1.2
## 84           5.1         1.6
## 85           4.5         1.5
## 86           4.5         1.6
## 87           4.7         1.5
## 88           4.4         1.3
## 89           4.1         1.3
## 90           4.0         1.3
## 91           4.4         1.2
## 92           4.6         1.4
## 93           4.0         1.2
## 94           3.3         1.0
## 95           4.2         1.3
## 96           4.2         1.2
## 97           4.2         1.3
## 98           4.3         1.3
## 99           3.0         1.1
## 100          4.1         1.3
## 101          6.0         2.5
## 102          5.1         1.9
## 103          5.9         2.1
## 104          5.6         1.8
## 105          5.8         2.2
## 106          6.6         2.1
## 107          4.5         1.7
## 108          6.3         1.8
## 109          5.8         1.8
## 110          6.1         2.5
## 111          5.1         2.0
## 112          5.3         1.9
## 113          5.5         2.1
## 114          5.0         2.0
## 115          5.1         2.4
## 116          5.3         2.3
## 117          5.5         1.8
## 118          6.7         2.2
## 119          6.9         2.3
## 120          5.0         1.5
## 121          5.7         2.3
## 122          4.9         2.0
## 123          6.7         2.0
## 124          4.9         1.8
## 125          5.7         2.1
## 126          6.0         1.8
## 127          4.8         1.8
## 128          4.9         1.8
## 129          5.6         2.1
## 130          5.8         1.6
## 131          6.1         1.9
## 132          6.4         2.0
## 133          5.6         2.2
## 134          5.1         1.5
## 135          5.6         1.4
## 136          6.1         2.3
## 137          5.6         2.4
## 138          5.5         1.8
## 139          4.8         1.8
## 140          5.4         2.1
## 141          5.6         2.4
## 142          5.1         2.3
## 143          5.1         1.9
## 144          5.9         2.3
## 145          5.7         2.5
## 146          5.2         2.3
## 147          5.0         1.9
## 148          5.2         2.0
## 149          5.4         2.3
## 150          5.1         1.8

While there is no equivalent $ syntax to select a row by name, you can select specific rows using the row names (in this case just numbers).

rownames(iris)
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
##  [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
##  [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
##  [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
##  [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
##  [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
##  [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
##  [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
##  [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [109] "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120"
## [121] "121" "122" "123" "124" "125" "126" "127" "128" "129" "130" "131" "132"
## [133] "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144"
## [145] "145" "146" "147" "148" "149" "150"
iris[c("100", "150"),]
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 100          5.7         2.8          4.1         1.3 versicolor
## 150          5.9         3.0          5.1         1.8  virginica

Subsetting data

Another way of partitioning dataframes is using the subset() function to return the rows of the dataframe for which the logical expression is TRUE. This allows us to the subset the data in a single step. The syntax for the subset() function is:

subset(dataframe, column_name == "value") Any logical expression could replace the `== “value”. For example, we can look at the samples of the species setosa only:

subset(iris, Species == "setosa")
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa

Practice:

Look at the results of the following commands.

levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
mean(subset(iris,Species == "setosa")$Petal.Width)
## [1] 0.246
mean(subset(iris,Species == "versicolor")$Petal.Width)
## [1] 1.326
mean(subset(iris,Species == "virginica")$Petal.Width)
## [1] 2.026
  1. Which species has the widest petals? Examine the results to determine the answer.
  2. Examine the code in more detail. Here, we have strung together several different functions, and are applying those functions to different sets of data. Do you understand what each element of the command is doing?
  3. Which species has the longest petals? Does the same species have the widest and longest petals? Edit the script to find the answer.

Part 3: Species ranges from GBIF data

In this section, we will create a map that shows everywhere on earth that a species has been found. We each can choose our own species. The examples will be done using the species Morpho menelaus, the blue morpho butterfly. Wherever you see its name, you’ll substitute your own species’ name.

Step 1: Choose your favorite species. The function gbif requires the scientific name of the species. So, if you don’t know it, open up your web browser and search for the scientific name. Example search: “blue morpho butterfly scientific name”. Note that there are several different species, all of which have the common name “Blue morpho”, which is one reason it’s more precise to use the scientific names!

For this species, Morpho menelaus, Morpho is the genus and menelaus is called the specific epithet (species indicator).

Step 2: Once you’ve chosen a species, navigate to https://www.gbif.org/ within your browser. Type in your species name in the search window, then hit the “Occurrences” along the top of the search bar. You may need to select ‘Yes’ to the question ‘Do you want to limit your search to this taxon only?’ on the top left of the results page. Examine the occurrences.

Step 3: Open a new script (File>New File>R Script). Hit Ctrl+S to save it with the name “species-range” or something like that. It will prompt you to save it in the test folder of your login, which is fine.

Downloading your data

You will use a function in the R package dismo to download your data. Paste the following into your new script file, substituting your species name, and run it by selecting it and hitting Ctrl+Enter or clicking “Run” in the top right corner of the script pane:

require(dismo)
## Loading required package: dismo
## Loading required package: raster
## Loading required package: sp
gbif('Morpho','menelaus',geo = FALSE, download = FALSE) #find out the total number of occurrences for this species in the database -- if this doesn't match the number of occurrences you see on the website, you should see if you typed something wrong!
## [1] 1870
raw_data <- gbif('Morpho','menelaus',geo = TRUE) #download all the occurrences with longitude and latitude data, which may not be all of them
## 1870 records found
## 0-300-600-900-1200-1500-1800-1870 records downloaded

Inspect the data you downloaded:

df <- raw_data #copy the GBIF download file into another data frame so we can start cleaning it. We do this so that we are not modifying the original data we downloaded.
dim(df) #GBIF returns a LOT of columns!
## [1] 1870  175

Look at the column names:

colnames(df)
##   [1] "acceptedNameUsage"                    
##   [2] "acceptedScientificName"               
##   [3] "acceptedTaxonKey"                     
##   [4] "accessRights"                         
##   [5] "adm1"                                 
##   [6] "adm2"                                 
##   [7] "associatedReferences"                 
##   [8] "associatedSequences"                  
##   [9] "basisOfRecord"                        
##  [10] "behavior"                             
##  [11] "bibliographicCitation"                
##  [12] "catalogNumber"                        
##  [13] "class"                                
##  [14] "classKey"                             
##  [15] "cloc"                                 
##  [16] "collectionCode"                       
##  [17] "collectionID"                         
##  [18] "collectionKey"                        
##  [19] "continent"                            
##  [20] "coordinatePrecision"                  
##  [21] "coordinateUncertaintyInMeters"        
##  [22] "country"                              
##  [23] "crawlId"                              
##  [24] "dataGeneralizations"                  
##  [25] "datasetID"                            
##  [26] "datasetKey"                           
##  [27] "datasetName"                          
##  [28] "dateIdentified"                       
##  [29] "day"                                  
##  [30] "depth"                                
##  [31] "depthAccuracy"                        
##  [32] "disposition"                          
##  [33] "distanceFromCentroidInMeters"         
##  [34] "dynamicProperties"                    
##  [35] "elevation"                            
##  [36] "elevationAccuracy"                    
##  [37] "endDayOfYear"                         
##  [38] "establishmentMeans"                   
##  [39] "eventDate"                            
##  [40] "eventID"                              
##  [41] "eventRemarks"                         
##  [42] "eventTime"                            
##  [43] "eventType"                            
##  [44] "family"                               
##  [45] "familyKey"                            
##  [46] "fieldNotes"                           
##  [47] "fieldNumber"                          
##  [48] "footprintSRS"                         
##  [49] "footprintWKT"                         
##  [50] "fullCountry"                          
##  [51] "gbifID"                               
##  [52] "gbifRegion"                           
##  [53] "genericName"                          
##  [54] "genus"                                
##  [55] "genusKey"                             
##  [56] "geodeticDatum"                        
##  [57] "georeferencedBy"                      
##  [58] "georeferencedDate"                    
##  [59] "georeferenceProtocol"                 
##  [60] "georeferenceRemarks"                  
##  [61] "georeferenceSources"                  
##  [62] "georeferenceVerificationStatus"       
##  [63] "habitat"                              
##  [64] "higherClassification"                 
##  [65] "higherGeography"                      
##  [66] "higherGeographyID"                    
##  [67] "hostingOrganizationKey"               
##  [68] "http://unknown.org/captive_cultivated"
##  [69] "http://unknown.org/language"          
##  [70] "http://unknown.org/modified"          
##  [71] "http://unknown.org/nick"              
##  [72] "http://unknown.org/orders"            
##  [73] "http://unknown.org/recordEnteredBy"   
##  [74] "http://unknown.org/recordID"          
##  [75] "identificationID"                     
##  [76] "identificationReferences"             
##  [77] "identificationRemarks"                
##  [78] "identificationVerificationStatus"     
##  [79] "identifiedBy"                         
##  [80] "identifier"                           
##  [81] "individualCount"                      
##  [82] "informationWithheld"                  
##  [83] "infraspecificEpithet"                 
##  [84] "installationKey"                      
##  [85] "institutionCode"                      
##  [86] "institutionID"                        
##  [87] "institutionKey"                       
##  [88] "isInCluster"                          
##  [89] "ISO2"                                 
##  [90] "isSequenced"                          
##  [91] "iucnRedListCategory"                  
##  [92] "key"                                  
##  [93] "kingdom"                              
##  [94] "kingdomKey"                           
##  [95] "language"                             
##  [96] "lastCrawled"                          
##  [97] "lastInterpreted"                      
##  [98] "lastParsed"                           
##  [99] "lat"                                  
## [100] "license"                              
## [101] "lifeStage"                            
## [102] "locality"                             
## [103] "locationID"                           
## [104] "locationRemarks"                      
## [105] "lon"                                  
## [106] "materialEntityID"                     
## [107] "modified"                             
## [108] "month"                                
## [109] "municipality"                         
## [110] "nameAccordingTo"                      
## [111] "nomenclaturalCode"                    
## [112] "occurrenceID"                         
## [113] "occurrenceRemarks"                    
## [114] "occurrenceStatus"                     
## [115] "order"                                
## [116] "orderKey"                             
## [117] "organismID"                           
## [118] "organismQuantity"                     
## [119] "organismQuantityType"                 
## [120] "originalNameUsage"                    
## [121] "otherCatalogNumbers"                  
## [122] "ownerInstitutionCode"                 
## [123] "parentNameUsage"                      
## [124] "phylum"                               
## [125] "phylumKey"                            
## [126] "preparations"                         
## [127] "previousIdentifications"              
## [128] "programmeAcronym"                     
## [129] "projectId"                            
## [130] "protocol"                             
## [131] "publishedByGbifRegion"                
## [132] "publishingCountry"                    
## [133] "publishingOrgKey"                     
## [134] "recordedBy"                           
## [135] "recordNumber"                         
## [136] "references"                           
## [137] "reproductiveCondition"                
## [138] "rights"                               
## [139] "rightsHolder"                         
## [140] "sampleSizeUnit"                       
## [141] "sampleSizeValue"                      
## [142] "samplingEffort"                       
## [143] "samplingProtocol"                     
## [144] "scientificName"                       
## [145] "scientificNameID"                     
## [146] "sex"                                  
## [147] "species"                              
## [148] "speciesKey"                           
## [149] "specificEpithet"                      
## [150] "startDayOfYear"                       
## [151] "subfamily"                            
## [152] "superfamily"                          
## [153] "taxonConceptID"                       
## [154] "taxonID"                              
## [155] "taxonKey"                             
## [156] "taxonomicStatus"                      
## [157] "taxonRank"                            
## [158] "taxonRemarks"                         
## [159] "tribe"                                
## [160] "type"                                 
## [161] "typeStatus"                           
## [162] "typifiedName"                         
## [163] "verbatimCoordinateSystem"             
## [164] "verbatimElevation"                    
## [165] "verbatimEventDate"                    
## [166] "verbatimIdentification"               
## [167] "verbatimLabel"                        
## [168] "verbatimLocality"                     
## [169] "verbatimSRS"                          
## [170] "verbatimTaxonRank"                    
## [171] "vernacularName"                       
## [172] "vitality"                             
## [173] "waterBody"                            
## [174] "year"                                 
## [175] "downloadDate"

Look at some of the fields for the first six rows:

head(df)[,c("species","continent","country","adm1","lat","lon")]
##           species     continent    country           adm1        lat       lon
## 1 Morpho menelaus SOUTH_AMERICA     Brazil Rio de Janeiro -22.421437 -42.72357
## 2 Morpho menelaus NORTH_AMERICA Costa Rica     Puntarenas   8.619720 -83.47618
## 3 Morpho menelaus SOUTH_AMERICA    Ecuador           Napo  -0.946811 -77.86990
## 4 Morpho menelaus SOUTH_AMERICA     Brazil       Rondônia  -9.879575 -62.83085
## 5 Morpho menelaus SOUTH_AMERICA       Peru  Madre de Dios -12.225983 -69.11453
## 6 Morpho menelaus SOUTH_AMERICA     Brazil Espírito Santo -19.066258 -40.14829

Data cleaning

Even though we specified geo=TRUE in our download, not all the occurrences are associated with exact coordinates. You can see this by examining the values in the ‘lat’ column - some are numbers, some are “NA”.

df$lat
##    [1] -22.421437   8.619720  -0.946811  -9.879575 -12.225983 -19.066258
##    [7]  -3.780987  -6.064480  -5.351419  -5.333967  -5.334458  -5.334078
##   [13]  -5.365805  -5.365753  -5.399935  -5.367108 -19.500424 -21.927737
##   [19]   4.818926 -19.500252   4.559838   1.181890   4.845483   4.846287
##   [25]   4.559448   4.614516   4.850600   4.852784   4.850377   4.816111
##   [31]   4.850355 -19.894333   4.324474 -12.569283   4.846330   4.846330
##   [37]   4.846492   4.747376 -20.344616   4.937918   4.711524  -9.303196
##   [43]   3.611642   4.583960   8.644759 -13.809266 -19.384084   1.263482
##   [49] -20.237344 -13.748128 -19.151444 -13.517420   1.191305   8.621288
##   [55]  -4.542322   9.811421   9.579149   9.600211   9.600211   9.600211
##   [61]   9.264426   8.514605  -7.130298   5.520283   4.951519 -20.122749
##   [67]  -9.594988  -9.595508   4.861367 -23.487708 -23.300117 -20.256110
##   [73]   4.884912  -9.596210  -1.471529 -14.133985 -19.564458 -10.548402
##   [79]   3.862786 -22.482838  -6.131725 -21.860191  -0.492308   0.297361
##   [85]  -4.953959  -4.953959 -14.051504  -2.690120  -9.595911 -15.797515
##   [91]  -9.201110  -8.497624  -3.249010  -0.947220   9.380656  -0.644075
##   [97]   4.749904  -6.078563   9.379315   8.653168   8.621111   8.594728
##  [103]   8.311056   8.962379   8.962379   9.128991   9.128991   9.117710
##  [109]   8.629521   8.640116   8.640116   8.629521   1.078775 -22.586237
##  [115]  51.215340 -19.884822   5.231796   0.430856 -12.436112  -5.956770
##  [121]   4.802023   4.802103   4.948531  -7.352031   3.606272   4.636310
##  [127]   8.618011   8.618011   9.128991 -15.564841 -22.584860  -6.062733
##  [133]  -1.196038   4.079598   4.277997   4.277997 -19.007645   4.705225
##  [139] -22.575515   3.285173  -2.552399  -5.981843  -7.105303  -2.651278
##  [145] -23.763152   4.931193   4.876801 -13.563002 -10.570198   5.124188
##  [151]  -1.292812  -7.118858  -7.115096 -16.225374   4.695541   1.102817
##  [157] -19.977343  -8.934522  -6.641455  -7.177139  -3.998387   9.166311
##  [163]   9.165129   9.166311   9.165129  -0.525955 -21.724566 -21.067681
##  [169] -12.896993   4.940300   9.372337   8.621080   8.637262   8.629564
##  [175]   9.120612   4.497143  -3.469904   4.171241 -19.154134   9.382699
##  [181]   9.382699  -3.118039  -6.069765  -6.176564 -23.014957   5.105623
##  [187]  -6.136080  -6.164994   5.475397   4.647578  -1.744070  -1.651950
##  [193]  -1.744710  -1.725640  -1.663760  -1.668910  -1.666190 -19.150469
##  [199]  -1.915662   3.933889  -4.292860  -8.779898 -19.151077   8.448322
##  [205]   8.448843   9.174080  -9.247283   4.392002   9.128487   8.641485
##  [211]  -3.007096  -6.166503  -6.170613   4.899158   5.073611 -12.520339
##  [217] -12.602535 -12.612850 -12.610083 -22.998989 -22.997564   2.985182
##  [223]   3.275450 -20.182722 -21.351531 -21.022766 -21.337994   4.637108
##  [229] -19.020479  12.560491 -17.141545 -17.084532   0.030507   5.806028
##  [235]  -0.639606 -20.989517  -0.639565   4.583895  -1.451277  -1.198754
##  [241]   4.880159   4.602963  -1.031409   4.873762  -5.943541  -1.007590
##  [247] -12.382073 -13.033549   4.170289 -13.033549   8.152850 -10.963043
##  [253] -27.195314   8.637262   8.621080  -0.464842  -0.469946   4.889988
##  [259]  -0.477259   4.898391 -15.865398  -9.598495 -23.434530  -9.597494
##  [265]   3.892480   4.161371  -9.597581   5.450330   0.033860   4.867750
##  [271]   4.339475   4.344530   5.496960  -0.520903  -4.240269  -1.104396
##  [277]   3.621452   8.478589   9.134002   4.956705 -12.535989  -9.959420
##  [283] -25.616750  -0.046257  -1.462441 -15.733222  -0.996406  -6.075373
##  [289]  -0.993889  -0.993612  -0.994270  -0.996315  -0.990805  -0.991284
##  [295]  -0.991295  -0.991257   5.070786   5.316863  -5.994185  -2.009421
##  [301]   0.046925   0.046925   0.046925  -2.801244  -2.817921   4.724493
##  [307]   4.724670   4.806228 -12.330908  -1.072632 -12.600330  -0.614629
##  [313]  -5.674653  51.211800  51.211800  -3.438712   4.831168  -9.019088
##  [319]   4.637630  -5.365989  -5.365938  -5.365961  -5.333961  -5.366143
##  [325]  -5.366143  -5.366107  -5.366038  -5.364730  -5.364427  -5.364773
##  [331]  -5.371189  -5.366143  -5.366143   4.943200  -0.527644   0.133861
##  [337]   0.654024   0.048250   0.048250   0.142194   1.123611   3.900480
##  [343]   3.900480  -6.832741   0.138806   1.287806   1.285639   8.392663
##  [349]  -9.955906  -9.955156  -9.955663  -9.954633  -1.429993   6.372135
##  [355] -22.436482 -22.436481 -15.873367   5.348903   5.348903 -14.096583
##  [361] -12.568718  -4.161945  -9.597507   7.301117   7.350550  -9.597507
##  [367]  -0.674358  -1.429216 -11.854378   5.348903   5.367658  -2.812197
##  [373]   5.348903  -9.775909  -9.583139  -9.246550   5.348903 -15.735737
##  [379] -15.629070  -9.417660 -19.329356   5.290370 -13.534361 -17.354317
##  [385]  -9.645191 -13.534361  -9.756327 -15.442202 -13.534361  -9.435092
##  [391] -15.729922 -15.729922 -15.733955  -4.248655  -1.065085  -3.007447
##  [397] -15.464492 -23.999964  10.717817   8.664617   8.664617   8.664617
##  [403]   8.664617  10.418362 -19.890869 -22.603365   4.886644 -11.217304
##  [409] -12.913517   4.857988   4.857988  -9.978787  -9.597200  -0.616315
##  [415] -20.459327   1.123611   1.123611   1.123611   1.123611   1.123611
##  [421]   1.123611   1.123611   1.123611 -12.225920 -13.519565  -5.987587
##  [427]   0.970882   4.284750   5.462587  -1.115459   4.930000   4.245448
##  [433] -12.337929 -19.981210  -0.676227 -16.665542 -15.865125 -22.586237
##  [439] -22.586237   3.801169   3.801287   5.757319  -3.271340  -3.249000
##  [445]   5.255068  -7.146626  10.412242  -0.674358 -15.867350 -11.240289
##  [451] -15.794639 -11.464624   9.390973 -21.790704   7.119333   7.119333
##  [457]   7.131167   7.130556   7.130556   7.131167   7.169639   3.753938
##  [463]   4.207365   3.960829   9.679075   3.844282   3.596610   2.234540
##  [469]   4.961593  -9.597495  -8.477452   4.600000   4.628000   4.600000
##  [475]   4.600000  10.409167 -12.679130 -12.679656 -12.679656  -1.051367
##  [481]  -9.327545 -16.056372   4.143682 -15.938923  -0.438419   9.119982
##  [487]  10.415557  10.419444  10.417500   4.945597   4.550672  10.416556
##  [493]  -9.597507  -9.597507 -20.760286  -9.597601  -9.597601  -9.597601
##  [499]   5.321280 -10.877051  -0.638063  -0.638063  -0.434600  10.421667
##  [505]  10.978489  10.978489         NA   5.147153  10.419444   4.890723
##  [511] -20.124211   9.154720  10.409444 -16.527353   9.657446   4.559955
##  [517]   4.554902  10.408611   1.259297  -1.046389 -12.615212   6.380780
##  [523] -12.607466   3.282412         NA  10.421667         NA   4.089000
##  [529]   4.089000  10.408611   4.828579  -2.448518  -8.041215 -13.540703
##  [535]   4.831660  10.408611  -2.541381  -2.541381  -2.541381  -9.597507
##  [541]  -9.597507   5.949730   4.552620   4.552620   4.552620   4.552620
##  [547]   4.552620   4.552620  10.417500  10.420456  10.416556   3.292210
##  [553]  -0.253300 -22.968072 -20.239484 -20.308725         NA         NA
##  [559]         NA         NA         NA         NA         NA   9.924149
##  [565]   9.974529   4.038000   4.098000   4.098000   4.098310   4.038000
##  [571]   4.565141  10.420833  10.419444  10.409667  10.420556   4.552620
##  [577]   4.552620   4.552620   4.552620   4.552620   4.552620   8.658924
##  [583]   8.658924 -12.957540   8.689572   8.689572  -1.708100  -1.779583
##  [589]   9.203782   8.656437         NA  10.420556   9.925414   9.546111
##  [595]   9.546111 -22.966849   9.925130   9.925414   9.924149   8.658924
##  [601]   9.928085  -1.759450  -1.726433  -1.702583  -9.597261   9.925130
##  [607]  -1.708100  -1.731850  10.201667 -20.305196   9.154720  10.409444
##  [613]  10.410278   9.243181         NA   9.925130  -1.752233  10.416556
##  [619]  10.417164   4.098723  10.416556         NA  -1.733383  -1.756183
##  [625]  -1.718550         NA  -1.711083  -1.786183  -1.706167  -1.702583
##  [631]  -1.725550  -1.748517  -1.705950  -1.730033  -1.718550  -1.759450
##  [637]  -1.732633  -1.706167  -1.723733  -1.719567 -12.603419   9.154720
##  [643]  -1.703400  -1.752233  -1.725200  -1.723900   9.571849  -1.705950
##  [649]   4.602220  -1.706167  -1.730033  -1.725550  -1.734767  -1.723900
##  [655]  -1.718550   4.558889   4.558889  10.417500   8.658924  10.417500
##  [661]  -1.703400  -1.723733  -1.723900  -1.737917  -1.706167  -1.708400
##  [667]  -1.705950  -1.706167  -1.706167  -1.782433  -1.725200 -19.153516
##  [673]   0.803056  -1.727433  -1.723900  -1.718550   8.536614  -1.719567
##  [679]  10.408611  10.410250  -1.706167  -4.005833  -1.734767  -1.748517
##  [685]  -1.756183  -1.756183  -1.777133  -1.752233  -1.730967  -1.755933
##  [691]  -1.748517  -1.775650   8.560906         NA  -1.706167  -1.737917
##  [697]  -1.786183  -1.780983  -1.703400  -1.727583  -1.728467         NA
##  [703]         NA  -1.727150  -1.722283  -1.705950  -1.759450  -1.721067
##  [709]  -1.711267  -1.734167  -1.780983  -1.706167  -1.702583  -6.069710
##  [715]  -6.069710         NA  -1.706167  -1.759450  -1.723900  -1.721067
##  [721]  -1.708400  -1.756183  -1.719567  -1.723733   5.707222   5.707222
##  [727]   5.707222  10.416556  10.416556  -1.708400  -1.706167  -1.708100
##  [733]  -1.706167  -1.727433  -1.756183  -1.759450   8.995561   8.995561
##  [739]  -1.722283 -23.750000  -1.727433  -1.756183  -1.721067  -1.711267
##  [745]  -1.727150  -1.728467   9.011023   9.011023  -1.727150  -1.784550
##  [751]  -1.708400  -1.703400  -1.703400  -1.723900  10.417500  -1.728467
##  [757]  -1.725550  10.416556 -23.750481   8.649823   8.649823   8.649823
##  [763]         NA         NA         NA         NA         NA         NA
##  [769]         NA         NA   9.778326         NA         NA         NA
##  [775]         NA         NA         NA         NA   8.405740   4.600000
##  [781]  -1.777133  10.400000  -5.066000  -5.066000  -3.800000  -3.800000
##  [787]         NA  10.409722         NA -17.351586         NA         NA
##  [793]         NA         NA         NA         NA         NA         NA
##  [799]         NA         NA         NA         NA         NA         NA
##  [805]         NA         NA  10.902333  10.902333  10.902333         NA
##  [811]         NA         NA         NA         NA         NA         NA
##  [817]         NA         NA         NA         NA         NA         NA
##  [823]  10.409722  10.409722         NA         NA         NA         NA
##  [829] -23.433782   9.657730   4.187754         NA -10.298600         NA
##  [835]   9.388318   4.600000   4.600000  -4.585533  -4.585533 -26.300000
##  [841]         NA         NA   4.600000 -10.298600         NA   8.680656
##  [847]         NA   8.356504         NA         NA   4.600000   4.600000
##  [853] -14.556196         NA         NA         NA         NA   8.625444
##  [859]  -0.384444         NA         NA         NA         NA         NA
##  [865]         NA         NA         NA         NA   4.490934   4.490934
##  [871]         NA         NA  10.883267         NA   1.267222         NA
##  [877]         NA         NA         NA         NA         NA         NA
##  [883]         NA         NA         NA         NA         NA         NA
##  [889]         NA         NA         NA         NA         NA         NA
##  [895]         NA         NA         NA         NA         NA         NA
##  [901]         NA         NA         NA         NA         NA   9.675378
##  [907]   9.675378  -3.784611  -0.490695   6.630806         NA         NA
##  [913]         NA         NA   9.675378         NA         NA         NA
##  [919]         NA         NA -10.340972 -10.340972  -9.906111         NA
##  [925]         NA         NA         NA         NA         NA   2.586896
##  [931]   2.586896   2.586896         NA         NA   9.671765         NA
##  [937]         NA         NA         NA         NA   8.480171   8.480171
##  [943] -10.819120 -10.819120   4.548917   4.548917   4.548917         NA
##  [949] -20.166667         NA   5.661111  10.992609  10.992609         NA
##  [955]  10.539549   8.480171   8.480171   8.480171   8.480171   4.660000
##  [961]   4.660000   4.660000   4.660000   4.660000   4.660000   4.660000
##  [967]   4.660000   4.880000   4.660000   4.660000   4.660000  10.992609
##  [973]         NA         NA         NA  -1.084083  -1.086528         NA
##  [979]         NA         NA         NA         NA   8.640794         NA
##  [985]         NA         NA         NA         NA         NA         NA
##  [991]         NA         NA         NA         NA         NA         NA
##  [997]         NA         NA         NA         NA         NA         NA
## [1003]         NA  -1.902056         NA         NA         NA         NA
## [1009]         NA         NA   5.405556   5.405556         NA         NA
## [1015]         NA         NA         NA   4.579976   4.579976  -1.902056
## [1021]         NA         NA         NA         NA         NA         NA
## [1027]         NA         NA         NA         NA         NA         NA
## [1033]         NA         NA         NA         NA         NA         NA
## [1039]         NA         NA         NA         NA         NA         NA
## [1045]         NA         NA         NA         NA         NA         NA
## [1051]         NA         NA         NA         NA         NA         NA
## [1057]         NA         NA         NA         NA         NA         NA
## [1063]         NA         NA         NA         NA         NA         NA
## [1069]         NA         NA         NA         NA         NA         NA
## [1075]         NA         NA         NA         NA         NA         NA
## [1081]         NA         NA         NA         NA         NA         NA
## [1087]         NA         NA         NA         NA         NA         NA
## [1093]         NA         NA         NA         NA         NA         NA
## [1099]         NA   8.480171         NA         NA         NA         NA
## [1105]         NA         NA         NA         NA         NA         NA
## [1111]         NA         NA         NA         NA         NA   5.661111
## [1117]         NA         NA         NA         NA         NA         NA
## [1123]         NA         NA         NA         NA         NA         NA
## [1129]         NA         NA         NA         NA         NA         NA
## [1135]         NA -10.000000 -12.838000         NA         NA         NA
## [1141]         NA         NA  -1.003821  -1.003821         NA         NA
## [1147]         NA         NA         NA         NA         NA         NA
## [1153]         NA         NA         NA         NA         NA         NA
## [1159]         NA         NA   8.480171   8.480171         NA   8.479267
## [1165]         NA         NA         NA         NA         NA         NA
## [1171]         NA         NA         NA         NA         NA         NA
## [1177]         NA         NA         NA   8.480171  -8.169850 -10.000000
## [1183]         NA         NA         NA -10.000000  -3.686371         NA
## [1189]         NA         NA         NA         NA         NA         NA
## [1195]         NA  -9.296795  -9.296795         NA         NA         NA
## [1201]         NA         NA         NA         NA         NA         NA
## [1207]   5.501527         NA  -1.901770  -9.297680         NA         NA
## [1213]         NA         NA   9.167817         NA         NA   2.387500
## [1219]         NA   5.536706   5.536706   5.536706   5.536706  -9.296795
## [1225]  -9.296795  -9.296795  -9.296795         NA         NA         NA
## [1231]  -7.146177  -7.146177  -7.146177  -7.146177  -7.146177  -7.146177
## [1237]  -7.146177         NA         NA  -9.296795  -9.296795  -9.296795
## [1243]         NA         NA  -3.789722         NA         NA   4.200000
## [1249]         NA         NA         NA         NA         NA   5.449722
## [1255]         NA -17.783296  -9.296795  -9.296795  -9.296795  -9.296795
## [1261]   5.319722   5.319722   5.416667   5.416667   4.200000 -10.417165
## [1267]         NA         NA  -1.908330         NA   9.156285   9.156285
## [1273]   9.156285   9.156285         NA  -1.908330  -1.908330         NA
## [1279]  -1.904858  -1.904858  -1.908330   4.993834   4.993834 -27.000000
## [1285]         NA  -2.062350         NA         NA -22.209167 -22.209167
## [1291]         NA         NA         NA         NA         NA         NA
## [1297]   1.267222         NA         NA         NA   5.633000         NA
## [1303]         NA         NA   5.787651         NA         NA -10.000000
## [1309]         NA         NA         NA         NA  -2.091750         NA
## [1315] -16.277100         NA         NA  -1.901770         NA  -1.901770
## [1321]         NA         NA         NA         NA  -9.379600 -16.183330
## [1327] -15.023050 -16.183330 -16.183330 -16.183330         NA  -3.866700
## [1333]         NA         NA         NA         NA -21.850471 -22.800466
## [1339]         NA         NA         NA         NA         NA         NA
## [1345]         NA         NA         NA         NA -23.183777         NA
## [1351]         NA         NA         NA         NA         NA         NA
## [1357]         NA         NA         NA         NA         NA         NA
## [1363]         NA         NA         NA         NA         NA         NA
## [1369] -10.304670  -9.300000         NA         NA         NA         NA
## [1375]         NA  -5.000000         NA         NA         NA         NA
## [1381]         NA         NA         NA         NA         NA         NA
## [1387]         NA         NA         NA         NA         NA         NA
## [1393]         NA         NA         NA         NA         NA         NA
## [1399]         NA         NA         NA         NA         NA         NA
## [1405]         NA         NA         NA         NA         NA         NA
## [1411]         NA         NA         NA         NA         NA         NA
## [1417]         NA         NA         NA         NA         NA         NA
## [1423]         NA         NA         NA         NA         NA         NA
## [1429]         NA         NA         NA         NA -18.340000  -2.500000
## [1435]         NA         NA   3.924040         NA         NA         NA
## [1441]         NA         NA         NA         NA         NA         NA
## [1447]         NA         NA         NA         NA         NA         NA
## [1453]         NA         NA         NA         NA         NA         NA
## [1459]         NA         NA         NA         NA         NA         NA
## [1465]         NA         NA         NA         NA         NA         NA
## [1471]         NA         NA         NA         NA         NA         NA
## [1477]         NA         NA         NA         NA         NA         NA
## [1483]         NA         NA         NA         NA         NA         NA
## [1489]         NA         NA         NA         NA -21.617168 -22.583766
## [1495]         NA -23.000476 -23.000481 -22.250473   3.924040         NA
## [1501]         NA         NA   3.924040         NA   9.154720   9.154720
## [1507]         NA         NA         NA         NA         NA         NA
## [1513]         NA         NA         NA         NA         NA   6.000000
## [1519]         NA   3.914510   3.914510   3.924040         NA         NA
## [1525]         NA   3.857430   4.187754         NA         NA         NA
## [1531]         NA         NA         NA         NA         NA         NA
## [1537]         NA         NA         NA         NA         NA         NA
## [1543]         NA         NA         NA         NA         NA         NA
## [1549]         NA         NA         NA         NA         NA         NA
## [1555]         NA         NA         NA         NA         NA         NA
## [1561]         NA         NA         NA         NA         NA         NA
## [1567]         NA         NA         NA         NA         NA         NA
## [1573]         NA         NA         NA         NA         NA         NA
## [1579]         NA         NA         NA         NA   4.000000   4.000000
## [1585]  -1.887268 -10.000000 -26.305575         NA -16.712000 -26.305575
## [1591]   5.209218   5.209218         NA         NA         NA         NA
## [1597]         NA         NA         NA         NA         NA         NA
## [1603]         NA         NA         NA         NA         NA         NA
## [1609]         NA         NA         NA         NA         NA         NA
## [1615]         NA         NA         NA         NA         NA         NA
## [1621]         NA         NA         NA         NA         NA         NA
## [1627]         NA         NA         NA         NA         NA         NA
## [1633]         NA         NA         NA         NA         NA         NA
## [1639]         NA         NA         NA         NA         NA         NA
## [1645]         NA         NA         NA         NA         NA         NA
## [1651]         NA         NA         NA         NA         NA         NA
## [1657]         NA         NA         NA         NA         NA         NA
## [1663]  -4.240965  -4.240965         NA         NA         NA         NA
## [1669]         NA         NA         NA         NA         NA         NA
## [1675]         NA         NA         NA         NA         NA         NA
## [1681]         NA         NA         NA         NA         NA         NA
## [1687]         NA         NA         NA         NA         NA         NA
## [1693]         NA         NA         NA         NA         NA         NA
## [1699]         NA         NA         NA         NA         NA         NA
## [1705]         NA         NA         NA         NA         NA         NA
## [1711]         NA         NA   5.189780         NA  -1.901770 -10.960990
## [1717] -16.277100 -16.277100         NA   3.878300  -1.901770 -12.499640
## [1723] -16.277100  -3.368410         NA -16.277100 -10.960990 -10.960990
## [1729]  14.978780  -2.293250  -9.428000 -10.994240 -10.994240 -10.960990
## [1735]  -9.956720  -5.239390  -4.000000   7.233330  -1.648620  -3.368410
## [1741]  -2.293250  -2.450630   5.189780   3.878300  -2.450630   3.878300
## [1747] -22.690000         NA  -1.592240 -21.975100         NA         NA
## [1753] -21.975100         NA   4.633300   8.952480  -9.428000  -9.900000
## [1759]  -8.360000         NA         NA         NA         NA         NA
## [1765]         NA         NA         NA         NA         NA         NA
## [1771]         NA         NA         NA         NA         NA         NA
## [1777]         NA         NA         NA         NA         NA         NA
## [1783]         NA  -1.084083  -4.267222  -3.422167   4.860417   4.860417
## [1789]   3.933889   6.804611  -3.106417  -3.106417         NA -10.865611
## [1795] -10.340972  -1.902056  -9.906861 -10.299083  -4.267222 -11.494944
## [1801] -11.505722 -10.340972         NA  -3.744667  -0.664917         NA
## [1807] -14.235000 -27.596917 -11.505722 -26.435528 -11.505722  -2.454944
## [1813]  -3.784611 -11.808278  -3.753361  -3.744667  -0.664917  -4.830333
## [1819]  -4.830333  -1.902056         NA  -1.902056  -1.902056         NA
## [1825]  -2.018139         NA  -1.998139  -1.902056   4.015028         NA
## [1831]         NA         NA         NA         NA         NA         NA
## [1837]         NA         NA         NA         NA         NA         NA
## [1843]         NA         NA         NA         NA         NA         NA
## [1849]         NA         NA         NA         NA         NA         NA
## [1855]         NA         NA         NA  -9.747699         NA         NA
## [1861]         NA         NA         NA         NA         NA         NA
## [1867]         NA         NA -17.457149   5.501527

Plus, we don’t need to use all the data/columns that are provided by GBIF for our mapping purposes. So, a common step in any workflow is to make sure you have the cleanest dataset possible.

First, we’re going to create a new dataframe that just has a few of the columns, the ones most relevant to our project:

df <- df[,c("species","continent","country","adm1", "basisOfRecord", "lat","lon")]

Next, we’re going to remove all the occurrences that don’t have latitude and longitude data.

df  <-  subset(df,!is.na(df$lon) & !is.na(df$lat))
nrow(df) #how many data points do we have now?
## [1] 1036

Then we transform all the negative longitude values so that the range goes from 0 to 360 instead of from -180 to 180. This will allow us to plot it on our map. We will add this as an extra column in “df” so that we can use either version.

westlongitudes  <-  which(df$lon < 0)
df[,"lon360"]  <- df[,"lon"] 
df[westlongitudes,"lon360"]  <-  360 + df[westlongitudes,"lon"] 

#Do you understand how these three lines work?

Next, we make a simple map to look for errors:

require(maps) #load the mapping library
## Loading required package: maps
map("world2",col = "darkgray") #generate the map
map.axes() #label the axes (longitude and latitude values) 
points(df$lon360,df$lat,col = "red",pch = 20) #plot the species occurrence points

This will be easier to read if we make it so the map shows only the part of the Earth where GBIF has occurrence records for our species.

map("world2",col = "darkgray",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), #one extra degree on each side for visibility
     ylim = range(df$lat,na.rm = T) + c(-1,1)) 
points(df$lon360,df$lat,col = "red",pch = 20)
map.axes()

In the example of the blue morpho butterflies, you can see that almost all the occurrences are from the tropical parts of South and Central America, but there are also a few others in Europe and Oceania. This could be that GBIF keeps track of not just verified scientific occurrences - they also store information on museum specimens as well as community science human observations. So let’s see what kinds of data points are included in your data set, and how many of them?

table(df[,"basisOfRecord"])
## 
##  HUMAN_OBSERVATION    MATERIAL_SAMPLE         OCCURRENCE PRESERVED_SPECIMEN 
##                540                 16                 24                456

But we don’t want to include all of these samples in our range map – we’re trying to look at the actual habitat range of the living species. We should make sure we’re only dealing with live observations, not with fossil or preserved specimens. Which points are /not/ from observations of living animals?

notobs <- which(!(df$basisOfRecord == "HUMAN_OBSERVATION" | df$basisOfRecord == "OBSERVATION" | df$basisOfRecord == "OCCURRENCE"))
map("world2",col = "darkgray",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1)) 
points(df$lon360,df$lat,col = "red",pch = 20)
points(df[notobs,]$lon360,df[notobs,]$lat,col = "black",pch = 21)
map.axes()

The points that aren’t from actual observations of living butterflies are circled in black. We’ll remove these points:

remove <- notobs
df <- df[-remove,] #remove the incorrect points
rm(remove)
nrow(df) #how many left now? 
## [1] 564

Plot what’s left again to see if anything looks like it’s in the wrong place. Then we’ll plot the data again to make sure there’s nothing else that stands out as probably incorrect:

map("world2",col = "darkgray",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20) #plot again with only the real data

Recall what you know about your species’ range. Do any of these occurrences look like they might be errors?

Data ‘cleaning’ is particularly important for data sourced from species distribution data warehouses such as GBIF. Such efforts do not specifically gather data for the purpose of species distribution modeling, so you need to understand the data and clean them appropriately, for your application.

My example species, the blue morpho butterfly Morpho menelaus, lives in South and Central American tropical rainforests. The points in Northern Europe seem pretty suspicious, on that basis; maybe they’re tagged incorrectly, and are actually captive individuals in a zoo, or even dead preserved specimens? Maybe someone incorrectly entered the latitute and longitude of the museum into the collection information? If you have data points in suspicious locations, take a look at them by filtering the latitude or longitude:

test1  <-  which(df$lon360 < 250) #tagging all points that aren't in the Americas, by longitude
map("world2",col = "darkgray",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)
points(df[test1,]$lon360,df[test1,]$lat,col = "black",pch = 21) #circle the flagged points in black

All the points we’ve identified as being in the wrong place are now circled in black. What can we find out about them?

df[test1,]
##             species continent country    adm1     basisOfRecord      lat
## 115 Morpho menelaus    EUROPE Belgium Antwerp HUMAN_OBSERVATION 51.21534
## 314 Morpho menelaus    EUROPE Belgium Antwerp HUMAN_OBSERVATION 51.21180
## 315 Morpho menelaus    EUROPE Belgium Antwerp HUMAN_OBSERVATION 51.21180
##         lon  lon360
## 115 4.42175 4.42175
## 314 4.41615 4.41615
## 315 4.41615 4.41615

These butterflies are in Antwerp, in Belgium, where there is a very famous zoo – and when I search for information about it, it appears it has a butterfly garden! I suspect these are captive specimens, so I want to exclude them from my data set.

remove  <-  c(test1)
df <- df[-remove,] #remove the incorrect points
rm(remove)

What’s left?

map("world2",col = "darkgray",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)

Those all look like reasonable places for blue morphos to live. Keep cleaning yours until you’ve gotten rid of any other data points that make no sense.

In a longer-term research project intended for publication, you would spend a lot more time on the data cleaning step, and indeed there are programs and functions for doing exactly that, but for today let’s leave it here.

Mapping species range

Now, how should we visualize the species range? We’ll start by drawing a polygon that encloses all the points (this is called a “hull”).

require(sf); require(concaveman) #load mapping libraries
## Loading required package: sf
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
## Loading required package: concaveman
sfdata <- st_as_sf(df,coords = c("lon360","lat")) #this reformats the coordinate points into a special data structure

conc <- concaveman(sfdata,concavity = 3,length_threshold = 0) #this is called a concave hull, it's a polygon that contains all the points

conv <- convHull(df[,c("lon360","lat")]) #this is called a convex hull, it's just a polygon drawn around all the points that stick out the most

Then make a map that shows the concave and convex hulls:

map("world2",col = "darkgray",
     xlim = range(df$lon360,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
map.axes()
points(df$lon360,df$lat,col = "red",pch = 20)
plot(conv,add = T,col = rgb(1,1,0,0.3),lty = "blank")
plot(conc,add = T,col = rgb(1,0,0,0.3),lty = "blank")
legend("topright",col = c(rgb(1,1,0,0.3),rgb(1,0,0,0.3)),
       legend = c("convex","concave"),pch = 15,bty = "n")

This isn’t very satisfactory as a map of species range, as it doesn’t take any notice of whether your species could actually live in all the ‘potentially unoccupied’ places in between the points you plotted. In the next part we’ll look at some environmental data to see if we can figure out a better way.

To save your map to your class file, click Export>Save as Image. Give it a name that contains the species name and your name.

Part 4: Environmental data

Download the climatic data from the WorldClim website.

require(geodata); require(raster);require(here)
## Loading required package: geodata
## Loading required package: terra
## terra 1.8.54
## Loading required package: here
## here() starts at /Users/jblois/Documents/GitHub/biodata_shortcourse/development
climate <- worldclim_global(var = 'bio',res = 2.5,path = here())
climate <- stack(climate)

The variable climate now contains a special data structure called a “RasterStack”, which consists of some number of matrices of exactly the same dimensions. (Think of it like a neatly aligned stack of maps.)

names(climate) #these names are annoyingly long, let's rename them
##  [1] "wc2.1_2.5m_bio_1"  "wc2.1_2.5m_bio_2"  "wc2.1_2.5m_bio_3" 
##  [4] "wc2.1_2.5m_bio_4"  "wc2.1_2.5m_bio_5"  "wc2.1_2.5m_bio_6" 
##  [7] "wc2.1_2.5m_bio_7"  "wc2.1_2.5m_bio_8"  "wc2.1_2.5m_bio_9" 
## [10] "wc2.1_2.5m_bio_10" "wc2.1_2.5m_bio_11" "wc2.1_2.5m_bio_12"
## [13] "wc2.1_2.5m_bio_13" "wc2.1_2.5m_bio_14" "wc2.1_2.5m_bio_15"
## [16] "wc2.1_2.5m_bio_16" "wc2.1_2.5m_bio_17" "wc2.1_2.5m_bio_18"
## [19] "wc2.1_2.5m_bio_19"
names(climate) <- unlist(sapply(1:19,function(x) paste0("bio",x)))
names(climate)
##  [1] "bio1"  "bio2"  "bio3"  "bio4"  "bio5"  "bio6"  "bio7"  "bio8"  "bio9" 
## [10] "bio10" "bio11" "bio12" "bio13" "bio14" "bio15" "bio16" "bio17" "bio18"
## [19] "bio19"

In the case of this climate data file that we just downloaded, those maps contain the values of 19 different climatic variables that are frequently relevant to species distributions, for all the land surface in the whole world (not the oceans).

Viewing the environmental data

You can plot any one of the layers to have a look at it. Call it by its name, using the $ operator, as an argument to the plot() function.

plot(climate$bio1)

This layer, bio1, is the average annual temperature. To see what each of the 19 bioclimatic variables means, look at https://www.worldclim.org/data/bioclim.html. Temperature measurements are given in tenths of a degree Celsius; precipitation is in millimeters.

Then you can plot your own species occurrence data on top of it, restricting the range of the map to the range of your occurrences plus 1 degree in each direction, the same way we did in Part 3. The climate data layers report longitude as going from -180 to 180, so we have to go back to the original longitude column (“lon”, not “lon360”):

plot(climate$bio1,
     xlim = range(df$lon,na.rm = T) + c(-1,1), 
     ylim = range(df$lat,na.rm = T) + c(-1,1))
points(df$lon,df$lat,col = "red",pch = 20)


Thinking about it

Overlay your species occurrences with each of the different bioclimatic data layers in bioclim (bio1 through bio19). - Do any of the bioclimatic variables seem to be important in controlling the range of your species? - If so, which ones? Save the images to your class folder for later reference. - What do you think about this? – Are you surprised by the results? – Can you think of a reason why these particular climatic variables might have a lot to do with the possible range of your species?

Tomorrow we’ll develop a quantitative model with these data to answer these questions!

Part 5. Saving your data

Save your data so you can load it again tomorrow. This is not straightforward on UC Merced computer lab computers, so please follow ALL of the following steps:

  1. Choose Session>Save Workspace As…
  2. In the popup window, choose Documents from the list on the left side under Quick Access.
  3. Give the file a UNIQUE name with YOUR NAME in it and click Save. For example, a good file name would be: Blois_Day1.RData (but replace “Blois” with your name!)

Your instructor will make sure these files are here for you to load tomorrow morning.